





Prepare Request



Generate Reports



Help

Wizard

Search Requests
New Search
Refine Search

Search Results
Clone Request

Clone Request Edit Request Cancel Request

#### Search Detail

#### **Submittal Details**

**Document Info** 

Title: Completing the Journey of Moore's Law

Document Number: 5221921 SAND Number: 2004-1890 P
Review Type: Electronic Status: Approved

Sandia Contact: DEBENEDICTIS, ERIK P. Submittal Type: Viewgraph/Presentation

Requestor: DEBENEDICTIS, ERIK P. Submit Date: 05/03/2004

Author(s)

DEBENEDICTIS, ERIK P.

Event (Conference/Journal/Book) Info

Name: Seminar at University of Notre Dame

City: South Bend State: IN Country: USA

Partnership Info

Partnership Involved: No

Partner Approval : Agreement Number :

Patent Info

Scientific or Technical in Content: Yes

Technical Advance : No TA Form Filed : No

SD Number:

Classification and Sensitivity Info

Additional Limited Release Info: None.

DUSA: None.

#### **Routing Details**

| Role                           | Routed To          | Approved By        | Approval Date |
|--------------------------------|--------------------|--------------------|---------------|
|                                |                    |                    |               |
| Derivative Classifier Approver | YARRINGTON,PAUL    | YARRINGTON, PAUL   | 05/03/2004    |
| Conditions:                    |                    |                    |               |
| Classification Approver        | WILLIAMS,RONALD L. | WILLIAMS,RONALD L. | 05/04/2004    |
| Conditions:                    |                    |                    |               |
| Manager Approver               | PUNDIT,NEIL D.     | PUNDIT,NEIL D.     | 05/04/2004    |
| Conditions:                    |                    |                    |               |
| Administrator Approver         | LUCERO,ARLENE M.   | LUCERO, ARLENE M.  | 05/14/2004    |
| printed 5/14/2004 (al)         |                    |                    |               |

Created by WebCo Problems? Contact CCHD: by email or at 845-CCHD (2243).

For Review and Approval process questions please contact the Application Process Owner

1 of 1 1/2/2008 2:38 PM



#### **SAND 2004-1890 P**

# Completing the Journey of Moore's Law

Presentation at Notre Dame May 4, 2004

Erik P. DeBenedictis
Sandia National Laboratories









#### **Outline**

- Applications of the Future
- Limits of Moore's Law
- How to Reach the Limit
  - Aerogel model
  - Applications Modeling
- No Need For a Breakthrough
- Architecture
- Beyond Moore's Law



# Simulation of Physics on a Computer

 Space is divided into cells, each with computer variables representing the physical state of the volume represented by the cell

 The computer updates the state of a cell for successive time intervals ∆T based on some physical laws

 I. e. S<sub>ijk</sub>' = f(S<sub>ijk</sub>, states of nearby cells)





# **Fourth Power Scaling Rule**





# **Example: Earthquake Risk Mitigation**

- In an Earthquake-prone region
  - Some areas the size of city blocks shake a lot
  - Others are stable
- This effect is due to focusing or deflection of seismic waves due to underground rock structure

- Mitigation
  - Identify dangerous areas and avoid building there
  - Identify dangerous
     areas by simulating
     many typical
     Earthquakes and noting
     the shaking
  - Requires an image of the underground rock structure



# **Example Application: Earthquake Mitigation**

#### Forward simulation

 Match the results of a seismic simulation with observed data from seismographs

#### Imaging

 Deduce the structure of rock under the region (imaging) by repeatedly simulating the error from forward simulation by adjoint methods

number 1, in press (2004)



# **Example: Earthquake Risk Mitigation**

#### Today

- Codes run at Caltech,PittsburghSupercomputer
- Uses frequencies to 1
   Hz, or a wavelength or several miles in rock
- Computers are about 5Teraflops

#### Limit

- Seismographs collect data to 20 Hz or more, or hundreds of feet in rock
- Buildings are hundreds of feet in size, so this is useful resolution
- Required computer 5
   Teraflops × 20<sup>4</sup> = 1
   Exaflops





# **Earthquake Risk Mitigation**

Algorithms: Written

Code: Runs

Input Data: Exists

Consequence of Not Proceeding: People Die

Required FLOPS: 1E = 1000P = 1,000,000T

- 25,000 × Earth Simulator





#### **Global Climate**

- Objective
  - Collect data about Earth
  - Model climate into the future
  - Provide "decision support" and ability to "mitigate"
- Approaches
  - Climate models exist, but need they more resolution, better physics, and better initial conditions (observations of the Earth)
- Computer Resources Required
  - Increments over current workstation on next slide



## **FLOPS Increases for Global Climate**

|                | Issue                                                                         | Motivation                   |
|----------------|-------------------------------------------------------------------------------|------------------------------|
| 1 Zetaflops    | Ensembles, scenarios<br>10×                                                   | Range of model variability   |
| 100 Exaflops 🗲 | Run length<br>100×                                                            | Long-term implications       |
| 1 Exaflops ←   | New parameterizations 100×                                                    | Supgrade to "better" science |
| 10 Petaflops   | Model Completeness<br>100×                                                    | Add "new" science            |
| 100 Teraflops  | Spatial Resolution<br>10 <sup>4</sup> × (10 <sup>3</sup> ×-10 <sup>5</sup> ×) | Provide regional details     |
| 10 Gigaflops - | Current                                                                       |                              |





#### **Outline**

- Applications of the Future
- Limits of Moore's Law
- How to Reach the Limit
  - Aerogel model
  - Applications Modeling
- No Need For a Breakthrough
- Architecture
- Beyond Moore's Law









## **Thermal Noise Limit**

This logical irreversibility is associated with physical irreversibility and requires a minimal heat generation, per machine cycle, typically of the order of kT for each irreversible function.

- R. Landauer 1961



Increverability and Heat Generation
in the Computing Process

and 4 or again of managing evolution involves these states again to large features

for a single computing evolution to the control of the

kT "helper line," drawn out of the reader's focus because it wasn't important at the time of writing

- Carver Mead, Scaling of MOS Technology, 1994



# Metaphor to FM Radio on Trip to Chicago

- You drive to Chicago listening to FM radio
- Music clear for a while, but noise creeps in and then overtakes music
- Why?
  - Signal at antenna weakens
  - Thermal electron noise constant at k<sub>B</sub> T

- Analogy: You live out the next dozen years buying PCs every couple years
- Electrical effect
  - Moore's Law causes switching energy of gates to decrease at about 30% per year
  - Thermal electron noise constant at k<sub>B</sub> T

Details: Erik DeBenedictis, "Taking ASCI Supercomputing to the End Game," SAND2004-0959



## FM Radio and End of Moore's Law



Driving away from FM transmitter→less signal Noise from electrons → no change



Increasing numbers of gates → less signal power Noise from electrons → no change



# **Amount of Reliability Needed**

- We expect computers to be reliable
- A future supercomputer will perform 10<sup>30</sup>-10<sup>40</sup> operations in its lifetime
- Error rate should be < 10<sup>-30</sup>
   10<sup>-40</sup>
- Reliability due to thermal noise about e-E/kt
- Need about e<sup>-100</sup> error rate, or 100 k<sub>B</sub>T switching energy

| SNR (db) | Power Ratio    | P <sub>error</sub>        |
|----------|----------------|---------------------------|
| 10       | 10             | 3.9×10 <sup>-6</sup>      |
| 14       | 25             | 6.8×10 <sup>-13</sup>     |
| 18       | 63             | 1.4×10 <sup>-29</sup>     |
| 22       | 160 Noise Limi | 3.3×10 <sup>-71</sup>     |
| 26       | 400            | 1.8×10 <sup>-175</sup>    |
| 30       | 1,000 2016     | 4.5×10 <sup>-437</sup>    |
| 34       | 2,500          | 7.1×10 <sup>-1094</sup>   |
| 38       | 6,300          | 2.2×10 <sup>-2743</sup>   |
| 42       | 16,000         | 1.8×10 <sup>-6886</sup>   |
| 46       | 40,000         | 3.8×10 <sup>-17293</sup>  |
| 50       | 100,000 Today  | 3.2×10 <sup>-43433</sup>  |
| 54       | 250,000        | 8.1×10 <sup>-10194</sup>  |
| 58       | 630,000        | 1.8×10 <sup>-274025</sup> |
| 62       | 1,500,000      | 9.6×10 <sup>-688315</sup> |

$$q := \int_{t}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^{2}}{2}} dx; t \to \sqrt{2 * 10^{\frac{50x}{10}}}$$





## **Noise Levels**

- 0 db Limit of hearing
- 20 db Rustling leaves
- 40-50 db Typical neighborhood
- 60-70 db Normal conversation
- 80 db Telephone dial tone
- 85 db City traffic inside car
- 90 db Train whistle @500'
- 95 db Subway train @200'
- 90-95 db Ear damage

- Today: 50 db
  - Thermal noise:Logic:: Rustling leaves:Talking
- 2016: 30 db
  - Thermal noise:Logic::Talking:Train Whistle
- Reliability limit 20 db
  - Thermal noise:Logic::Outsideneighborhood:Talking



## **Personal Observational Evidence**

- Have radios become better able to receive distant stations over the last few decades with a rate of improvement similar to Moore's Law?
- You judge from your experience, but the answer should be that they have not.
- Therefore, electrical noise does not scale with Moore's Law.





- Generalization of Moore's Law
  - Projects many parameters
  - Years through 2016
  - Includes justification
  - Panel of experts
    - known to be wrong
  - Size between
     Albuquerque white and yellow pages





# **Semiconductor Roadmap**

| Un or or Electronical                                                                                                | 2010 | 00.10    | 2024     |
|----------------------------------------------------------------------------------------------------------------------|------|----------|----------|
| YEAR OF PRODUCTION                                                                                                   |      | 2013     | 2016     |
| DRAM & PIICH (nm)                                                                                                    |      | 32       | 22       |
| MFU / ASIC % FITCH (tom)                                                                                             |      | 33       | 23<br>13 |
| MPU PRINTED GATE LENGTH (mm)                                                                                         |      | 18<br>13 | 9        |
| MPU PHINCAL GAIE LENGTH (hm)<br>Physical gate length high-performance (HP) (nm) [1]                                  |      | 13       | 9        |
| Equivalent physical exide thickness for high-performance $T_{ex}$ (EOT)( nm) [2]                                     |      | 0.4-0.6  | 0.4-0.5  |
| Gate depletion and quantum effects electrical thickness adjustment factor (nm) [3]                                   |      | 0.5      | 0.5      |
| $T_{ox}$ electrical equivalent $ ho m)$ [4]                                                                          |      | 1.0      | 0.9      |
| Nominal power supply voltage $(V_{dd})$ $(V)$ [3]                                                                    |      | 0.5      | 0.4      |
| Naminal high-performance NMOS sub threshold leakage current, Labet, (at 25°C) (\$61/\$tm) [6]                        |      | 7        | 10       |
| Nominal high-performance NMOS saturation drive current, Las (at V <sub>dd</sub> , at 23° C) (\$£4\$tm) [7]           |      | 1500     | 1500     |
| Required percent current-drive "mobility/transconductance improvement" [8]                                           |      | 70%      | 100%     |
| Farasitte saurce/drain resistance (Rad) (olym 1944)                                                                  |      | 90       | 80       |
| Favasitic capacitance percent of theal gat 1,000 k <sub>B</sub> T/transistor                                         | 25%  | 30%      | 35%      |
|                                                                                                                      | 31%  | 36%      | 42%      |
| High-performance NMOS device t (C <sub>gate</sub> * Vdf/140-cvntOsype) [12]                                          |      | 0.22     | 0.15     |
| Relative device performance [13]                                                                                     |      | 72       | 10.7     |
| Energy per (WE $_{ m gate}$ =2) device switching transition ( $C_{ m gate}$ *( ${ m PE}_{ m gate}$ ) ((WDevice) [14] |      | 0.007    | 0.002    |
| Static power dissipation per (WL sate=8) device (Watts/Device) [13]                                                  |      | 1.4E-07  | 1.1E-07  |

White—Manufacturable Solutions Exist, and Are Being Optimized Yellow—Manufacturable Solutions are Known Red—Manufacturable Solutions are NOT Known



# **Limits for a Red Storm-Sized Computer**





#### **Outline**

- Applications of the Future
- Limits of Moore's Law
- How to Reach the Limit
  - Aerogel model
  - Applications Modeling
- No Need For a Breakthrough
- Architecture
- Beyond Moore's Law





## Can We Reach the Limit?

- Method: Compare modeled running time on perfect computer to real computer
- Application: Local calculations with global time step (SOR)
- Technology comparison:
  - 22 nm transistors with 3D atom-by-atom assembly
  - Our best shot at an architecture
- Definition of Success: Our best shot comes within a constant factor of the theoretical peak





# **Aerogel Computer**

- Devise algorithm for a hypothetical aerogel computer
  - Cell may be gate
  - Cell may be memory
  - Is space for cooling, but no cooling
- Model application runtime
- Engineer real computer
- Model application runtime
- If runtimes similar, you succeeded



Element = Bit of memory or part of logic gate (transistor)





# **Aerogel Cooling**

- Inflate aerogel computer to point where heat emerging from faces is less than capacity of a designated cooling system
  - Air 45KW/m²
  - Water 62MW/m²
  - Pulse ∞W/m²







# **Architecture Target**





# **Global Synchronization**







# **Application Modeling**

- Sample Problem
  - 3D finite difference equation with global synchronization
  - SOR method

$$T_{Step} = \frac{K \times F_{cell}}{floprate} + T_{Global}$$

- where
  - K is memory size

 Global synchronization limited by speed of light

$$T_{Global} \ge \frac{2\sqrt{3} \times L_{Edge}}{c}$$

- where
  - L<sub>Edge</sub> is edge dimension of cube

$$6 \times L_{\text{Edge}}^2 \times C_{x} \leq Power$$



# **Actual Applications Modeling**

- Actual code was several hundred lines of C++
- Theoretical limit covered
  - Coolant
- Realistic covered
  - Layout on a 2D surface of a particular size
  - Heat sink limits
  - I/O bandwidth from chip
  - Coolant

```
// Physical Constants
           double kB = 1.3806503e-23:
                                                   // Roltzmann's constant J/K
           double T = 300;
                                                         // room temperature K
            double c = 299792458;
                                                   // speed of light m/s
            double MetersPerFoot = 2.54*12/100;
           // Parameters that could be static
            double HSSGBits = 40e9;
                                                   // HSS speed (bits/s)
            double ChipArea = .02 * .02;
                                            // Nominal area of a chip = 2 cm x 2 cm = 400mm^2 (m^2)
            //double ChipArea = 140e-6;
                                                   // MPU High Volume per ITRS 1h 2002 (m^2)
            I/double ChipArea = 572e-6
                                                   // ASIC maximum chip size at production per ITRS 1j 2002 (m^2)
            double FloatBits = 64;
                                                   // number of bits per floating point number (bits)
            double GrindFLOPS = 9;
                                                   // number of flops per SOR update (floating ops)
            double RentalCostSquareFootPerYear = 12; // rental cost of real estate ($ per square foot per year)
            double CostPerChip = 1000;
                                                   // purchase price per chip in a system ($)
           double KWHCost = .15;
                                                  // price per kilowatt-hour of electricity ($/KWH)
            double DepreciationFactor = .3; // fraction of HW cost to amortize per year
            double FracSpeedOfLight = .1; // signal propagation velocity as fraction of c
           double WordsPerMemory = 1000;// number of words in primitive memory
            double TotalNodes = n*n*n/K;
           double SystemMemoryBits = FloatBits*n*n*n;
double SystemCPUGates = FloatCells*TotalNodes,
            double TotalCells = SystemMemoryBits + SystemCPUGates
            double MeshUpdateTime = GrindFLOPS*K*FloatTau*LogicProcess.Tau;
            double PropagationVelocity = Magic ? c : FracSpeedOfLight*c;
                                                                            // speed of signal propagation
           // FLEETZero branchmerge
            // properties for the branch-merge circuit down to WordsPerMemory word memories
            double BranchMergePerNode = ceil(K/WordsPerMemory)-1;
            double FastBranchMergePerNode = min(BranchMergePerNode, 31);
           double SystemFastBranchMergeGates = TotalNodes * 30*FastBranchMergePerNode*FloatBits;
nates per bit * 64 bits
               ComputerInstance Test = *this;
                          // Fraction of chip area occupied, rest will be left empty
                          Test.FractionChipOccupancy = tTransistorsPerChip/MaxTransistorsPerChip;
                                   Test.FacePowerDensity = Test.SystemPower/6/Test.LEdge/Test.LEdge/
                                 double SquareFeetFloor = SystemVolumeCubicFeet/8*2;
```



# Performance on Sample Problem





## **Cost Efficiency**



Memory/Node



#### **Outline**

- Applications of the Future
- Limits of Moore's Law
- How to Reach the Limit
  - Aerogel model
  - Applications Modeling
- No Need For a Breakthrough
- Architecture
- Beyond Moore's Law



# **Example of Computer at Physics Limit**

- Sandia is often approached by people who say we need some elaborate technology in order to run our applications at the Petaflops level
  - Do we need elaborate technology?
  - Is the person just looking for research funding?
- Question: can we make a computer that runs at the limits out of inexpensive components?
  - Yes, subsequent slides are example







#### **Outline**

- Applications of the Future
- Limits of Moore's Law
- How to Reach the Limit
  - Aerogel model
  - Applications Modeling
- No Need For a Breakthrough
- Architecture
- Beyond Moore's Law





#### Which Microarchitecture?

- Task: Pick a winner
  - Candidates μP, PIM,
     vector, FPGA,
     reconfigurable,
     streaming, maybe more
  - Each has advantages
  - Not clear which is best
  - Government gets bad press for picking winners too early

- Why do we pick winners
  - Logic is a scarce resource
  - But hang on a minute, don't we have more transistors than we know what to do with, and even turn some off at times?
- Can we change the rules of the game to make NOT picking a winner a virtue?





#### Multi-Architecture Idea

Architecture to comprise

 μP and accelerator architectures 1 and 2

 Power control
 circuit so only one is turned on at a time

- Benefit
  - Can expect support from cluster community and advocates of architectures 2 and 3
- Arch2=Vector, Arch3=PIM?







#### **Outline**

- Applications of the Future
- Limits of Moore's Law
- How to Reach the Limit
  - Aerogel model
  - Applications Modeling
- No Need For a Breakthrough
- Architecture
- Beyond Moore's Law



## **Beyond Moore's Law**





#### **Reversible Logic**

- Reversible logic dissipates energy through "friction"
- If you run reversible logic at speed  $\infty$  1/n, it will dissipate power  $\infty$  1/n<sup>2</sup>
- However, any design will have a parasitic power loss, so actual loss is not  $\infty$  1/n<sup>2</sup>, but

Power = 
$$\frac{P_0}{n^2} + P_{parasitic}$$

Measured power down 4×, limit 2000×





- 8×8 Multiplier Designed, Fabricated, and Tested by IBM & University of Michigan
- Power savings was up to 4:1



#### **Reversible Microprocessor Status**

#### Status

- Subject of Ph. D. thesis
- Chip laid out (no floating point)
- RISC instruction set
- C-like language
- Compiler
- Demonstrated on a PDE
- However: really weird and not general to program with +=, -=, etc. rather than =





#### **Thought Model for Reversible Red Storm**

- Replace each Red Storm node with chips constructed from n<sup>2</sup>≅1000 layers of reversible logic operating 1/n≅1/30 speed
- Overall system 30× faster, same power, 1000× nodes

Active Area O(1μm) thick

Substrate

 Will become feasible for small "line width"

> Active Area n² ≅1000 Layers

Substrate



## **Thought Model for Reversible Red Storm**

|             | Conventional<br>Logic Red Storm | Reversible n=30<br>Red Storm |  |
|-------------|---------------------------------|------------------------------|--|
| Nodes       | 10,000                          | 10,000,000                   |  |
| FLOPS/node  | 4 Gigaflops                     | 100 Megaflops                |  |
| Total FLOPS | 40 Teraflops                    | 1 Petaflops                  |  |





#### **Summary**

- Applications based on "simulating physics on a computer" scale up quite a ways
  - Gave one example at 1 Exaflops & 1 Zetaflops
- Semiconductor roadmap comes pretty close to physical limits for current class of computers
  - Had chart with numerical FLOPS targets
  - Microprocessors cost about 100×
- Other classes of computers are possible, but introduce disruptive change





#### **Ideas for Future Work**

- For computer architecture and software
  - Show scalability to the physical limits, but not beyond
- Estimate FLOPS for important problems to society that can be solved with computers
  - Which will be solvable with a computer of the current class, but scaled by Moore's Law?
  - Which will require a new class of computer?
    - These problems create a mandate for research into new classes of computer





# **Backup**



# **General Specifications at Physics Limit**

|             | Red Storm             | Limit<br>μ <b>P Mode</b> | Limit<br>Turbo Mode       |
|-------------|-----------------------|--------------------------|---------------------------|
| Nodes       | 10,000                | 200,000                  | 2,000,000                 |
| Node Type   | μΡ                    | μΡ                       | TBD – say 10 vector pipes |
| Clock       | 2 GHz                 | 20 GHz                   | 20 GHz                    |
| Flops/node  | 4 GFLOPS              | 40 GFLOPS                | 400 GFLOPS                |
| Sys. Peak   | 40 TFLOPS             | 8 PFLOPS                 | 800 PFLOPS                |
| MPI Latency | <b>2.5</b> μ <b>S</b> | 100 ns                   | N/A – no MPI              |
| Power       | 2 MW                  | 2 MW                     | 2 MW                      |



## The Cost of Beating Moore's Law

- A "1" and "0" must have more than 100× the thermal energy to avoid errors
  - Lowering the temperature doesn't help, it just shifts power to the refrigerator
- Today's irreversible logic destroys "1"s and "0"s at each gate. However, "reversible computing" recycles the energy in "1"s and "0"s. There is no known limit to "reversible computing."
- Quantum computing offers the possibility of exponential speedups



#### Packaging for a Spatial Locality

- Basic Module
  - 2 Nodes
  - Each node is an ASICSystem On ChipProcessor In Memory
  - Each node has memory under ASIC
  - Each module includes a power module
  - Six mesh Interconnects
- Modules connect end-toend in "Shish Kabobs"





## Packaging for a Spatial Locality

• Entire supercomputer is a single structure

 All mesh network wires are of constant length (8" max)

Air flows front to back

General approach will work for liquid cooling as well





## **Nearest-Neighbor Interconnect**

- X Dimension
  - From one board to another laying in the same plane – 2"
- Y Dimension
  - 8" from one board to another spaced above or below – 8"
- Z Dimension
  - Along the Shish Kabob4"
  - Name courtesy Monty
     Denneau IBM





#### **Maintenance**



- Each "Shish Kabob" can be removed for maintenance
- Connects via side-connect technology
  - Similar to Cray shuttle connectors on T3E and X1
- Each Shish Kabob can be composed of segments to avoid limits on PC board technology
- Depth should be OK to 6'



### **Backup: Landauer's Arguments**

- Landauer makes three arguments in his 1961 paper
  - Kintetics of a bistable well
  - Entropy generation
- We review the second →

 Entropy of a system in statistical mechanics:

$$S = k_B \log_e(W)$$

W is number of states

 Entropy of a mechanical system containing a flip flop in an unknown state:

$$S = k_B \log_e(2W)$$

After clearing the flip flop:

$$S = k_B \log_e(W)$$

Difference k<sub>B</sub> log<sub>e</sub>(2)



#### Backup: Landauer's Arguments II

- Second law of thermodynamics says entropy of universe must increase
  - Entropy is disorder
- Say you clear a computer memory of n bits. The computer's memory is initially disordered (arbitrary bits) but becomes ordered (all zero). Entropy goes down.

- However, entropy of universe must increase.
- Resolution is that the material of the memory chip becomes more disordered (hotter), offsetting the information in the memory
- A logic gate with multiple inputs but one output has fewer output states than input states: same idea



### Backup: k<sub>B</sub>T Should Not Be A Surprise

This logical irreversibility is associated with physical irreversibility and requires a minimal heat generation, per machine cycle, typically of the order of kT for each irreversible function.

- R. Landauer 1961



In recoverability and Heat Generalise
In the Comparing Process

In the

kT "helper line," drawn out of the reader's focus because it wasn't important at the time of writing

Carver Mead, Scaling of MOS Technology, 1994





#### **Backup: Floating Point**

- A floating point unit has about 100,000 gates
- About 20,000 gates will switch for each operation
- Therefore,

E<sub>FLOP</sub> ≈ 20,000 E<sub>gate</sub> ≈ 2,000,000 k<sub>B</sub> T

- Landauer limit is:100 TFLOPS/watt
- Accounting for engineering losses, more realistic:

10 TFLOPS/watt

 If a μP is 1% efficient, the probable limit for a microprocessor is:

10 TFLOPS/watt chip





- Minimum power per logic op 100 k<sub>B</sub>T
- Minimum power per FLOP 2×10<sup>6</sup> k<sub>B</sub>T
- Analysis
  - At any T, performance may depend on cooling
  - Cutting T won't save power because of offsetting power in refrigerator, but may make cooling system more efficient
- However
  - Applications modeling indicates DOE apps aren't especially dependent on cooling
- Conclusion: Use room temperature

## Backup: Authority on μP Efficiency

# Data parallelism realizes full potential of increased transistor count





Citation: Bill Dally, ASCI PI Meeting 2004



### Backup: Authority on µP Efficiency

# Data parallelism realizes full potential of increased transistor count





Citation: Bill Dally, ASCI PI Meeting 2004

A8CI-PI: 12



#### **Backup: Languages**

- For many years, computer languages have targeted higher programmer productivity, trading easy programming for higher resource consumption during execution. This was believed to be OK because Moore's Law would cut the excess cost over time. Not so anymore
- Need to study languages for mature "irreversible logic" computers that are both easy to use and avoid excessive use of resources



## **Backup Slide: Analog Computing**

- Floating Point Energy/Op
  - $-20,000 \times 100 \times k_BT =$
  - $-2\times10^6 k_BT$
- Analog Energy/Op
  - k<sub>B</sub>T log<sub>e</sub>("# states")
  - $-k_BT \log_e(2^{64})$
  - 64 k<sub>B</sub> T log<sub>e</sub>2
  - $-44 k_BT$
- Analog 45,000 more efficient

- Heisenberg Uncertainty Principle
  - $-\Delta E \Delta t \geq h/(2\pi)$
- Waiting Time

$$-\Delta E = 2^{-64} \times 64 k_B T \log_{2} 2$$

$$-\Delta t \ge \frac{h}{2\pi \times 2^{-64} \times 64 \text{ k}_{B} \text{T log}_{e} 2}$$

- $-\Delta t$  ≥ ~3 hours
- Analog really slow

